2022-04-26

Missing value analysis

What are missing values

Missing observations are defined as NA in R.

Missing data can have different implications for data summaries, analyses and conclusions based on the data with missing values.

Amount of missing data

  • Matrix perspective
  • Variable perspective
  • Case perspective

Example data

The example data has 25 rows and 5 columns.

head(datm, 15)
##             X1           X2          X3          X4         X5
## 1  -1.54231016           NA -1.53963888 -1.00463582         NA
## 2   0.12566899  0.299850333  0.70981759 -0.36098509 -0.2001262
## 3  -0.36613648 -0.919290814  0.67424464 -0.12020154 -0.3580860
## 4           NA  0.163966085 -0.35275879          NA         NA
## 5   1.15997488  0.033070842 -0.45699570  1.35991710  0.3961639
## 6           NA  0.461028375  0.03532637          NA         NA
## 7   0.21913719  0.090383991 -0.37945894  0.28158872  0.4755253
## 8  -0.07999332           NA  0.10230557 -0.67126265         NA
## 9  -0.09104919           NA -0.93235947  0.02538092         NA
## 10 -0.31494673 -0.007399564 -0.65943985  0.34740561 -0.2464995
## 11 -0.14943881  0.527039396  0.89058224  0.29186606  0.2958048
## 12 -0.18526418 -1.426163995  1.07974150 -0.21383023 -0.6432199
## 13          NA  0.682816262  0.43148781          NA         NA
## 14  0.03554797           NA  0.16174751  0.28501286         NA
## 15  0.06089833           NA -1.24165935 -0.39956247         NA

Matrix perspective

Matrix perspective: the number of missing entries in the data matrix.

The is.na function returns TRUE if a cell is missing (NA) and FALSE if a cell is observed.

In the example there are 24 missing data entries. The data frame contains 5 variables for 25 subjects, which makes a total of 125 data entries. So, 19.2% of the data entries are missing.

sum(is.na(datm))
## [1] 24
sum(is.na(datm))/length(is.na(datm))
## [1] 0.192

Variables perspective

Variables perspective: the number of missing values per variable.

For each variable we can count the number of missing observations (n) and calculate the proportion (p).

datm %>%
  is.na %>%
  data.frame() %>%
  summarise_all(list(n = sum, p = mean)) %>%
  pivot_longer(everything(), 
               names_to = c("variable", ".value"),
               names_pattern = "(.*)_(.)")
## # A tibble: 5 x 3
##   variable     n     p
##   <chr>    <int> <dbl>
## 1 X1           4  0.16
## 2 X2           6  0.24
## 3 X3           0  0   
## 4 X4           4  0.16
## 5 X5          10  0.4

Case perspective

Case perspective: the number of rows, i.e. cases, with missing values.

Many analysis methods only use the rows that are fully observed: complete-case analysis.

The data are then listwise deleted.

datm %>% 
  is.na %>%
  data.frame() %>%
  mutate(n_miss = rowSums(.),
         missing = ifelse(n_miss > 0, "rows with misings", "rows without missing")) %>%
  group_by(missing) %>%
  summarise(n = n(),
            p = n/ 25)
## # A tibble: 2 x 3
##   missing                  n     p
##   <chr>                <int> <dbl>
## 1 rows with misings       10   0.4
## 2 rows without missing    15   0.6

Case perspective - mice

  • cci: create an indicator for the number of fully observed rows.
mice::cci(datm)
##  [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [13] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## [25]  TRUE
  • nic: count the number of incomplete cases, i.e. cases with missing values.
mice::nic(datm)
## [1] 10
  • ncc: count the number of complete cases, i.e. cases full fully observed rows.
mice::ncc(datm)
## [1] 15

Missing data patterns

Missing data pattern: the combination of observed and unobserved values that occur together in a row. Generally notated as having a 0 for a missing value and a 1 for an observed value.

Data often contains multiple different missing data patterns. The example shows three missing data patterns:

  1. All variables are observed, so a row of only ones.
  2. Three variables observed and two missing.
  3. Two variables observed and three missing.
mice::md.pattern(datm, plot= F)
##    X3 X1 X4 X2 X5   
## 15  1  1  1  1  1  0
## 6   1  1  1  0  0  2
## 4   1  0  0  1  0  3
##     0  4  4  6 10 24

row-names: the number of times the pattern occurs in the data; last column: the number missing values the missing data pattern holds.

Missing data pairs

Missing data pair: the number of times two variables are either missing together or observed together.

How many cases we can actually use for imputation. The md.pair function from the mice package returns four matrices. Each matrix gives us information about combinations of missing values in our data.

  • response-response (rr) the count of how often two variables are both observed.
  • response-missing (rm) the count of how often the row-variable is observed and the column-variable is missing.
  • missing-response (mr) the count of how often the row-variable is missing and the column-variable is observed.
  • missing-missing (mm) the count of how often two variables are both missing.
pat <- mice::md.pairs(datm)

Response-response

Observed value counts.

pat$rr
##    X1 X2 X3 X4 X5
## X1 21 15 21 21 15
## X2 15 19 19 15 15
## X3 21 19 25 21 15
## X4 21 15 21 21 15
## X5 15 15 15 15 15

Response-missing

Missing value counts when rows are observed.

pat$rm
##    X1 X2 X3 X4 X5
## X1  0  6  0  0  6
## X2  4  0  0  4  4
## X3  4  6  0  4 10
## X4  0  6  0  0  6
## X5  0  0  0  0  0

Missing-response

Missing value counts when columns are observed.

pat$mr
##    X1 X2 X3 X4 X5
## X1  0  4  4  0  0
## X2  6  0  6  6  0
## X3  0  0  0  0  0
## X4  0  4  4  0  0
## X5  6  4 10  6  0

Missing-missing

Missing value counts.

pat$mm
##    X1 X2 X3 X4 X5
## X1  4  0  0  4  4
## X2  0  6  0  0  6
## X3  0  0  0  0  0
## X4  4  0  0  4  4
## X5  4  6  0  4 10

Information for imputation

The proportion missing-response from the sum of the missing-response and missing-missing matrices shows how many usable cases the data have to impute the row variable from the column variable.

round(100 * pat$mr / (pat$mr + pat$mm))
##     X1  X2  X3  X4  X5
## X1   0 100 100   0   0
## X2 100   0 100 100   0
## X3 NaN NaN NaN NaN NaN
## X4   0 100 100   0   0
## X5  60  40 100  60   0

X3 has no missing values